A phenotype driven integrative framework uncovers molecular mechanisms of a rare hereditary thrombophilia

Antithrombin resistance is a rare subtype of hereditary thrombophilia caused by prothrombin gene variants, leading to thrombotic disorders. Recently, the Prothrombin Belgrade variant has been reported as a specific variant that leads to antithrombin resistance in two Serbian families with thrombosis. However, due to clinical data scarcity and the inapplicability of traditional genome-wide association studies (GWAS), a broader perspective on molecular and phenotypic mechanisms associated with the Prothrombin Belgrade variant is yet to be uncovered. Here, we propose an integrative framework to address the lack of genomic samples and support the genomic signal from the full genome sequences of five heterozygous subjects by integrating it with subjects’ phenotypes and the genes’ molecular interactions. Our goal is to identify candidate thrombophilia-related genes for which our subjects possess germline variants by focusing on the resulting gene clusters of our integrative framework. We applied a Non-negative Matrix Tri-Factorization-based method to simultaneously integrate different data sources, taking into account the observed phenotypes. In other words, our data-integration framework reveals gene clusters involved with this rare disease by fusing different datasets. Our results are in concordance with the current literature about antithrombin resistance. We also found candidate disease-related genes that need to be further investigated. CD320, RTEL1, UCP2, APOA5 and PROZ participate in healthy-specific or disease-specific subnetworks involving thrombophilia-annotated genes and are related to general thrombophilia mechanisms according to the literature. Moreover, the ADRA2A and TBXA2R subnetworks analysis suggested that their variants may have a protective effect due to their connection with decreased platelet activation. The results show that our method can give insights into antithrombin resistance even if a small amount of genetic data is available. Our framework is also customizable, meaning that it applies to any other rare disease.

As presented in Section Materials and methods of the main paper, the Simultaneous Orthogonal Non-negative Matrix Tri-Factorization, SONMTF, can be formulated as the following minimization problem: min P,S,G,U i ≥0 f (P, S, G, U i ) = min P,S,G,U i ≥0 s.t. P T P = I and G T G = I, where F denotes the Frobenius norm, M is a matrix containing the germline variant profiles of the subjects, R 1 ,R 2 ,R 3 represent the adjacency matrices of PPI, COEX and GI molecular networks, respectively, P is a matrix relating n s subjects to n g genes, S is interpreted as the compressed representation of the molecular profiles, G is interpreted as the cluster indicator matrix of genes, and U i is interpreted as the compressed representation of each molecular network. Note that, as explained in the next section, P is a fixed matrix factor.
Following the semi-NMTF simplification [1] for a more computationally tractable solution, we remove the non-negativity constraint on S, U i ≥ 0. To solve the optimization problem, we derive the Karush-Kuhn-Tucker (KKT) conditions for our SONMTF as follows: where ⊙ is the Hadamard (element wise) product and matrix η is the dual variable for the primal constraint G ≥ 0. Because adjacency matrices R i are symmetric, therefore matrices U i are symmetric, too. For U i and S, we have closed formulas: As explained in [2], we derive the following multiplicative update rule to solve the KKT conditions above: We start from an initial solution, G init , and iteratively use Equations (1) and (2) to compute new matrix factors U i , S and G until convergence. To generate initial G init , we use the Singular Value Decomposition based strategy [3]. This strategy makes the solver deterministic and also reduces the number of iterations that are needed to achieve convergence [3].
We measure the quality of the factorization by sum of the relative square errors (RSE) between the decomposed matrices and the corresponding decompositions: In our implementation, the iterative solver stops after 1000 iterations, the value for which the RSE of the decomposition is not decreasing anymore.

Subject stratification and gene clusters
In the main paper, we present the generic SONMTF framework used for integrating germline variants, protein-protein interactions, co-expressions and genetic interaction data. The outputs of the SONMTF algorithm are interesting gene clusters that take into account the phenotypic differences between diseased subjects and the healthy carrier. For this reason, in the first run of our data-integration framework, we set the number of subject clusters to two. By default, solving SONMTF leads to a subject stratification (from matrix factor P , see Section Materials and A phenotype driven integrative framework uncovers molecular mechanisms of a rare hereditary thrombophilia methods of the main paper) that is best supported by the variant profiles of the subjects and by the considered molecular networks. In our case, subjects are grouped according to their family relationships, which is expected (see Figure 1, panel A). To account for the observed phenotypes of the subjects, it is possible to enforce the subject stratification (i.e., to force the diseased subjects to be in the same cluster and the healthy subject to be in a different cluster) by fixing matrix factor P (see Figure 1, panel B). For sanity check, for each of the two runs (when fixing or not the subject stratification), we extract the corresponding clusters of genes and measure their biological coherence by the percentage of them that are significantly enriched in at least one biological annotation.
As presented in Figure 2, while forcing the subject stratification slightly reduces the functional enrichment of the obtained clusters of genes, our clusterings are highly biologically coherent.

Default
Fixed P

A) B)
Figure S 1: Cluster indicator matrices for the subjects. A: The cluster indicator matrix, P , that is obtained when solving the default SONMTF (see Section Methods of main paper). It groups subjects B1, B2, and D1 into cluster 1, and subjects S1 and S2 into cluster 2. B: The cluster indicator matrix, P , that is kept fixed in order to group together the diseased subjects (B1, D1, S1, and S2) into cluster 1, and the healthy subject (B2) into cluster 2.
On the one hand, the clusters of genes that are obtained with the default solver lead to variant profiles (percentages of genes with variants per cluster) that are very similar across subjects (Figure 3, panel A), while the clusters of genes that are obtained when fixing P lead to variant profiles that are different for healthy and diseased subjects (Figure 3, panel B). Importantly, fixing P leads to clusters of genes that better separate healthy and disease-specific variants (Figure 3, panels C and D).
To test the robustness of our method to data imbalance, we make pairwise runs of our SON- MTF data integration framework using pairs of patients, i.e., B1-B2, D1-B2, S1-B2 and S2-B2. All the corresponding gene clusters are in agreement with the ones obtained when using all five individuals together, with Rand indices ranging from 88.7% to 90.1%. All these large agreements are statistically significant, with permutation-based p-values (using 100,000 permutations) all smaller than 10 −5 . We also checked these agreements at the individual cluster level. For example, 86.7% to 95.3% of the genes from the ADRA2A-TBXA2R cluster are also grouped together in the same cluster in the pairwise clusterings. Hence, our methodology and results are robust to the data imbalance.

Analysis of the germline variants
As a sanity check, we assess the relevance of the genes with variants of our five subjects with thrombophilia by investigating their known associated phenotypes in DisGeNet v6.0 [4]. These genes with variants are annotated with 492 phenotype annotations. In particular, the majority of the identified phenotypes are classified as "Laboratory Procedure" semantic type (123 phenotypes, out of 492), which are all related to blood tests, e.g., Blood Protein Measurement, Corpuscular Hemoglobin concentration Mean and Triglycerides Measurement. We further analyze the phenotypes classified as "Disease or Syndrome" by performing a systematic literature search in PubMed database [5]. We automatically retrieve the number of scientific publications that associate each phenotype to thrombophilia by searching for co-occurrences between the two in PubMed. We call thrombophilia-related phenotypes those ones that co-occur at least one time with thrombophilia.
We find that our gene variants are associated with 125 thrombophilia-related phenotypes out of 492, e.g., Rheumatoid Arthritis, Diabetes Mellitus, Non-Insulin-Dependent and Leukemia Myelocytic Acute. These findings suggest that considered gene variants are highly related to blood tests and disorders (50.4% of the phenotype annotations found for our gene variants are either thrombophiliarelated or associated with blood tests), which is in accordance with thrombophilia-related tests and risk of thrombosis. Indeed, both Rheumatoid Arthritis and Diabetes Mellitus are autoimmune diseases that are linked to an increased risk of venous thrombosis [6]. To assess the relevance of the observation that thrombophilia co-occurs with 125 of the "Disease or Syndrome" phenotypes that annotate the subjects' genes with variants in PubMed articles, we also measure the co-occurrences between any two of these "Disease or Syndrome" phenotypes in PubMed articles. We find that thrombophilia co-occurs more with the phenotypes that annotate the subjects' genes with variants in PubMed articles than 73% of the considered "Disease or Syndrome" phenotypes. This further suggests that the genes with variants of our five subjects are indeed related to thrombophilia.
Moreover, we analyze the impact of our germline variants using the PhD-SNPg method [7]. We use PhD-SNPg to extract pathogenic variants and compare these results with our findings. We find that only 173 out of the 17,104 genes considered in our study are predicted as pathogenic variants (p-values< 0.05). Moreover, we compute a hypergeometric test to check if these pathogenic variants are over-represented in our reported clusters. The results show that the cluster containing F2 also contains 14 predicted pathogenic variants and F12/TGFB1 cluster contains 12 predicted pathogenic variants out of 695 (hypergeometric p-values< 0.05). Instead, the disease-specific subnetwork contains six predicted pathogenic variants out of 461, and the healthy-specific subnetwork does not contain them, which is in accordance with its healthy specificity. Interestingly, PhD-SNPg predicted pathogenic variants are not annotated as thrombophilia genes.
In the main manuscript, we find some candidate genes that need to be further investigated. However, the majority of these genes have already been annotated for other diseases (17 out of 20).
We report in Table 2 their disease associations found using DisGeNet. Interestingly, some of the diseases associated with our candidate genes are related to blood conditions.